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Abstract 

Background: The use of classification algorithms is becoming increasingly important for the field of computational 
biology. However, not only the quality of the classification, but also its biological interpretation is important. This 
interpretation may be eased if interacting elements can be identified and visualized, something that requires 
appropriate tools and methods. 

Results: We developed a new approach to detecting interactions in complex systems based on classification. Using 
rule-based classifiers, we previously proposed a rule network visualization strategy that may be applied as a heuristic 
for finding interactions. We now complement this work with Ciruvis, a web-based tool for the construction of rule 
networks from classifiers made of IF-THEN rules. Simulated and biological data served as an illustration of how the 
tool may be used to visualize and interpret classifiers. Furthermore, we used the rule networks to identify feature 
interactions, compared them to alternative methods, and computationally validated the findings. 

Conclusions: Rule networks enable a fast method for model visualization and provide an exploratory heuristic to 
interaction detection. The tool is made freely available on the web and may thus be used to aid and improve 
rule-based classification. 
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Background 

Technological developments have increased the ability to 
generate and store large amounts of data. However, for 
the data to be useful relevant methods for their analysis 
are needed. Classification methods are algorithms that 
automatically learn from such large data sets; however, the 
requirements on such methods are quite high and the 
need for new classification methods have been stressed, 
especially the need for methods that are able to identify 
interactions in the data [1-3]. For instance, single nucleo- 
tide polymorphisms (SNPs) found in genome-wide associ- 
ation studies using traditional statistical analysis can only 
explain small fractions of many common diseases [4] and 
classifiers using those markers may be of poor quality [5]. 
It has been suggested that this is due to the lack of gene- 
gene and gene-environment interactions in the models [1] 
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and efforts have been made to develop specific tools, e.g. 
for the identification of SNP interactions [6]. 

Rule-based classifiers are one type of classifiers. Their 
strength lies in the fact that they are comparably easy to 
interpret while still producing models of reasonable 
quality, which have made them suitable for applications 
in systems biology. Rule-based classifiers have earlier 
been applied to a wide spectrum of problems in genom- 
ics, proteomics, epigenetics, e.g., predict gene ontology 
terms from gene expression time profiles [7], to interpret 
microarray data [8], to model cleavage of polypeptide 
octamers by the HIV-1 protease [9], to model ligand- 
receptor interactions [10], and to classify Alzheimer's pa- 
tients [11]. 

A rule-based classifier consist of a set of IF-THEN rules 
that describes the relations in the training data almost in 
natural language based on the original feature names. 
There are different software packages that can generate 
rules including ROSETTA [12], and WEKA [13]. Rule- 
based classifiers are non-linear and the identified rules 
may describe important features and interactions in the 
data. An intuitive heuristic to identify putative interactions 
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from a set of rules is to search the rules for combinations 
of conditions that occur frequently in them. However, a 
classifier typically contains a large number of rules, which 
sometimes may be very complex with five to ten, or even 
more, conditions. Thus, new tools are needed to support 
the visualization and interpretation of the rules. 

Most attempts to visualize rules have concerned asso- 
ciation rules. For an overview of such visualization tech- 
niques, see for example [14,15]. Software previously 
developed for this task includes the R package arulesViz 
[16] that uses a two-dimensional matrix in which similar 
rules are clustered. However, most methods scale poorly 
with an increased number of rules. We were impressed 
by the readability of the circular graphs produced by the 
Circos software [17] and decided to use it for rule 
visualization. To our knowledge, the only attempt to 
visualize rules in a circular layout was done for associ- 
ation rules by [18]. 

We therefore present Ciruvis: a web-based tool [19] 
for the visualization of conditions that are associated in 
the rules using a circular layout. It relies on a scoring 
system previously introduced by [20] for which we now 
provided a free-to-use web-based implementation. The 
tool may produce both separate rule networks for each 
decision outcome and a combined network. In this study 
we focused on the detection of interaction effects in 
those networks, although they may also be valuable 
solely for visualization purposes. 

Using different types of simulated data sets, we showed 
that applying our tool to ROSETTA rules may identify in- 
teractions in the data. Furthermore, we applied the tool to 
real data in order to compare it to other methods and to 
illustrate its use. The tool is fast, scales well with the num- 
ber of rules and is easy to use. 

In conclusion, we believe that Ciruvis may facilitate 
visualization of rule-based classifiers and the discovery 
of interactions. 

Methods 

Rule terminology 

A rule describes a relation between the rule conditions 
(the left-hand-side, LHS, of the rule) and the rule out- 
come (the right-hand- side, RHS). For example, a rule 
taken from a classifier for leukemia based on gene ex- 
pression is: IF MIF='high' AND GPXl=loW THEN 
type= chronic lymphocytic leukemia . 

The rule support is the number of objects that fulfill 
the LHS of the rule, and the accuracy is the fraction of 
those objects that also fulfill the RHS of the rule, or 
equivalently, accuracy = P (RHS | LHS). A rule condition 
has the form feature='vdlue (for example MIF= 'high') 
and a rule may have one or multiple conditions. The 
rule outcome has the form of class= Value' and there is 
only one such feature. 



Definition of the rule network 

Ciruvis is a tool to visualize combinations of rule condi- 
tions that are important for a particular rule outcome. 
Each condition that has at least one connection to an- 
other condition is placed as a node on the outer ring of 
the circle in an alphabetical order. Two conditions are 
connected inside the circle if they co-occur in some rule 
(s). The score of the connection between two conditions, 
x and y, is defined as 

connection (x,y) = ^ support (r) -accuracy (r) 

where R(x,y) is the set of all rules in which x and y co- 
occur. 

The connections are shown as edges between the 
nodes. The width and color of the edges are related to 
the connection score (low = yellow and thin, high = red 
and thick). The inner ring shows the color of the condi- 
tion on the other side of the connection. The width of a 
node is the sum of all connection to it, scaled so that all 
nodes together cover the whole circle. 

Parameters and user interface 

To run Ciruvis, a rule file must be submitted either in 
the ROSETTA or in a line-by-line format. Several op- 
tional filtering and formatting parameters are available 
(Additional file 1: Table SI). A screen shot from the 
submission form and the results page are shown in 
(Additional file 2: Figure SI). One rule network is gener- 
ated for each possible outcome. The figures are inter- 
active, and by clicking on the edge between two conditions, 
all rules containing that combination of conditions are 
shown. If the Ctrl key is held while selecting multiple 
edges, the intersections of rules from these edges are 
shown. The name of a node is shown when the mouse is 
hovered over it. It is possible to download the Ciruvis 
figure in the Scalable Vector Graphics (SVG) format 
and the feature labels as an HTML table which both can 
be easily edited and used to produce publication-quality 
figures. 

Generation of simulated data 

We used simulated data to test the ability to detect in- 
teractions using the networks. The dataset was con- 
structed to contain both noise, features correlated to the 
decision, and pairs of interacting features. The interact- 
ing features were defined so that they together were pre- 
dictive for the decision but that each of them was 
uncorrelated to it. Translated into a real-world situation, 
this could represent a situation with SNPs of which 
some lack marginal effects on the outcome, but have an 
interaction effect caused by gene-gene interactions or 
epistasis. 
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For each data set we defined five correlated features 
with expected correlation c = X*i/4<, where X was the 
maximal correlation for that data set and i = 0, 1, 4. 
Each correlated variable was named after the index i and 
its correlation c according to Ci_c. Similarly, we defined 
five pairs of interacting features which, when taken to- 
gether, were predictive for the outcome with the prob- 
ability p = YH/4, where Y was the maximal value for that 
data set and i = 0, 1, 4. The features of the pairs were 
named Ri_p and Si_p where i was an index 0 < = i < = 
4, and p was their probability of being predictive. 

Each choice of the parameters X and Y thereby repre- 
sented one data set with 15 features. In order to gener- 
ate datasets with different properties, we allowed X and 
Yto take all values in {0.00, 0.05, 0.10, 0.95, 1.00}, 
which defined 21*21 = 441 datasets. In each dataset 1000 
objects were generated using the algorithm below. Note 
that the RandomQ function returns only discrete values and 
thus, that both the decision and the features are discrete. 

CreateObject(X,Y) 



1. 


Decision <- RandomQ 


2. 


foreach /' (0 < ; < 4) 


3. 


c <r X*i/4 


4. 


if Probability(c) 


5. 


Ci_c <- Decision 


6. 


else 


7. 


C/_c <- Random[) 


8. 


p <r Y*i/A 


9. 


R/'_p <- RandomQ 


10. 


if Probability(p) 


11. 


if R/_p = Decision 


12. 


Si_p <r 1 


13. 


else 


14. 


S/_p <r 0 


15. 


else 


16. 


S/_p <- RandomQ 



Here RandomQ is a function that returns 0 or 1 with 
equal probability, and Probability{q) is a function that 
returns true with probability q and false otherwise. We gen- 
erated 50 replicate data sets for each combination of X and 
Yand trained a classification model on each of those. The 
classification accuracies presented were the averaged over 
those 50 models and all rules from the replicates were 
merged together for Ciruvis to construct an average picture. 

Rule-based classification using ROSETTA 

The rule-based classifiers were constructed using the 
ROSETTA toolkit for analysis of tabular data [12,21]. 
ROSETTA is a mathematical framework capable of de- 
riving IF-THEN rules from a set of training examples. 



Boolean reasoning is used to compute minimal sets of 
features, called reducts, able to discriminate between the 
training examples equally well using all features. Based 
on the feature values in the training data, the reducts are 
transformed into rules that describe minimal sets of fea- 
ture conditions associated with a particular decision 
class. Combined, these rules may be used to classify pre- 
viously unseen objects. 

Algorithms and parameters are described shortly in the 
results section and in more detail in the Supplementary 
methods (Additional file 3: Supplementary methods). The 
quality of each classifier was measured by the classifier ac- 
curacy (the proportion of correctly classified objects) 
which was estimated using 10-fold or leave-one-out cross 
validation. 

Results and discussion 

Detection of correlated versus interacting features in 
simulated data 

To investigate how well the rule networks from Ciru- 
vis could detect feature interactions, we first tested it 
using simulated data. The data contained both fea- 
tures correlated to the decision and pairs of interact- 
ing features predictive for the decision. The level of 
correlation and pairwise predictability was determined 
by two parameters that defined a maximum level for the 
most predictive feature/pair in the dataset. The maximum 
level of correlation, X, and of interaction, Y } was varied be- 
tween 0 and 1. Then, for each data set the number of cor- 
rectly classified objects was counted (Additional file 4: 
Figure S2). As expected, there were usually more correctly 
classified objects when the features were more predictive 
(as measured by higher X and/or Y). Surprisingly, a higher 
level of interaction increased or at least retained the classi- 
fication quality, whereas a higher correlation sometimes 
decreased the quality. Specifically, the quality was de- 
creased when the pairwise correlation was high and the 
correlation increased over 0.20-0.30. When the interaction 
level was 1.00 this was the most evident, since the average 
number of correctly classified objects decreased from 
998-999 out of 1000 for X< 0.25 to a local minimum 828 
at X= 0.45. 

This suggests that the rule generation algorithm was 
biased towards finding rules containing features corre- 
lated to the decision. When the correlated features were 
not present, then the combinatorial rules of higher qual- 
ity were more likely to be found. The identified masking 
became one of the focuses in our study. 

Next, we investigated the behavior of the rule net- 
works for different datasets (Figure 1). Since both the 
features and the decision were binary only the networks 
for outcome "0" are presented. Based on the data gener- 
ation algorithm opposite values of the R and S variables 
were expected to predict the "0" decision, e.g., IF R = 0 
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X=0,4& f Y=100 X=0.55. Y=0,55 X=0.75, Y=0.55 



RO HR1 HR2 MR3 B|R4 
■ ■ SO MSI BS2 MS3 H 

Figure 1 Rule networks for simulated data. Rule networks for twelve different pairs of maximum correlation X and interaction Y for the "0" 
outcome. The parameter choices (A-L) correspond to points in Additional file 4: Figure S2. The correlated features are named CO to C4 (lowest to 
highest correlation), and the pairs RO, SO to R4, S4 (lowest to highest correlation). The colors were specified so that the interacting pairs have the 
same color. Each feature occurs twice in the figure; the first time with the value 0 and the second with 1. 

V J 



AND S = 1 THEN DEC = 0, whereas equal values predict 
the "1" decision. The aim was to observe how small in- 
teractions could still be detected and to learn about their 
properties; for instance, whether they would be masked 
by features strongly correlated to the decision. 



Using X=0.00 and Y=0.10 we could identify visible 
connections between pairs with an interaction level at 
10, 8, and 5% (Figure 1A). The connections between 
"R4_10" and "S4_10" were the two strongest in the figure 
demonstrating that very weak interactions may be 
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detected in the Ciruvis networks even in the presence of 
very noisy data. This particular example also illustrated 
that the rules from a classifier may be informative, even 
when the quality of classification is essentially not better 
than "random guessing". 

In the following runs we processed datasets with a small 
background correlation, X=0.15 (Figure 1B-H). With Y= 
0.10 the pair with a 10% chance of interaction was barely 
visible, and not among the highest scored connections in 
the figure (Figure IB). As Y was increased the two (or 
three) highest scored pairs became step-by-step more vis- 
ible (Figure 1C-E) and when Y was set to 0.55 or higher 
the three most interacting pairs were by far the strongest 
connections (Figure 1F-G), with the exception of Y= 1.00 
when the third pair (R2 + S2) was masked by the more 
predictive pairs (Figure 1H). 

Similarly, when the best interaction was 100% predictive 
(r=L00) and with higher correlation (X=035 or X = 
0.45, respectively), the strongest interacting pair was 
highly visible and the second pair had indeed a visible 
connection, but it was on the same level as some of the 
noise (Figure II- J). Although it is useful to know that 
stronger rules may mask weaker ones, masking caused by 
perfect correlation would normally not be expected in a 
real data set. 

When the dataset had both a high level of correlation 
and interactions, the connections for the two strongest 
interacting pairs were visible, but not the strongest con- 
nections (Figure 1K-L). However, the true interactions are 
shown as connections from conditions with otherwise few 
and weak connections, while connections that are artifacts 
caused by combinations of correlated features origin from 
conditions with a lot of strong connections. 

An observation in all of the generated rule networks 
was that at most three (out of four non-zero) interacting 
pairs appeared in the networks. A likely explanation is 
that the stronger interactions mask the weaker ones, 
similarly to how strong correlations do. 

Removal of correlated features decreased the masking of 
weak interactions 

In the previous section we showed that when features 
correlated to the decision were roughly as strong or 
stronger than the interacting pairs, the latter were 
masked by the former. Subsequently, rules containing 
the interacting pairs were rarely found or barely visible 
in the rule networks. To investigate whether the removal 
of correlated features from the data would benefit to the 
detection of the pairs, we used the data from Figure IB 
(in which the pairs are heavily masked) and removed the 
correlated features C4 and C3 (15% and 11% correlation, 
respectively). The pair with the highest interaction (R4 + 
S4, with interaction frequency 10%) subsequently be- 
came relatively stronger (Figure 2A-B). For instance, in 



Figure 2 A the connection score between "S4 = 1" and 
"R4 = 0" is 0.7% of the total score in the figure, which in- 
creases to 1.8% in Figure 2B; becoming the strongest con- 
nection in the figure. The increase for the combination 
"S4 = 0" and "R4 = 1" was smaller but still significant, from 
0.6% to 1.1%. In addition in Figure 2B the "R3 = 0" and 
"S3 = 1" pair could be identified (increased from 0.4% to 
0.7%), although the connection was still weak. When the 
last two correlated features (C2 and CI with 8% and 4% 
correlation, respectively) were removed as well, the 
strength of the first and the second pair increased sharply 
(to 4.3% and 1.4% respectively) (Figure 2C). 

Comparison to other methods using real data 

In order to compare the interaction detection to other 
methods, and to apply the methodology to real data, we 
used the California Housing [22] dataset downloaded 
from [23]. This dataset was chosen as it had previously 
been subject for interactions detection [24]. 

California Housing describes housing value based on 
1990 census data in California. The decision is the median 
value of a block group {medianHouseValue) and there are 
8 features. We discretized the decision into three groups; 
one group of houses valued >500 000 which was encoded 
as '2', the remaining houses were split at their median into 
the intervals 0-173 600 and 173 601-499 999 (encoded as 
'0' and T, respectively). We used the features longitude, 
latitude, housingMedianAge, total Rooms, population, and 
median Income previously selected by [24] to build a rule- 
based model using ROSETTA. The numeric features were 
discretized using EqualFrequencyBinning with 4 intervals. 
The model accuracy was estimated using 10-fold cross 
validation. 

The medianlncome feature was highly correlated to 
the decision (r = 0.61; Additional file 5: Figure S3) and 
when the rule-based model was built to include it, it 
dominated the strongest connections (Additional file 6: 
Figure S4). An alternative model was built excluding 
medianlncome which reduced the accuracy of the model 
from 72.4% to 66.5% as important information was ex- 
cluded, but made the identification of interacting pairs 
easier. Inspecting the rule networks (Figure 3), we iden- 
tified the ten strongest connections for each outcome 
(Additional file 7: Table S2). For instance, for medianHou- 
seValue = 0 three of these described combinations of con- 
ditions with specific values for latitude and longitude, 
three combinations with population and totalRooms, two 
with population and longitude, and two with totalRooms 
and longitude. For each one of these specific combinations 
of features, we computed whether it had a significant 
interaction effect (see Additional file 3: Supplementary 
methods for details). Additionally, we computed the ex- 
pected accuracy (Additional file 7: Table S2) by first esti- 
mating the effect of each condition separately and then 
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R.4 ID 
■ S4_10 

Figure 2 Correlated features mask weak interactions. Rule networks for the outcome "0" in the simulated data. The data parameters are 
X=0.10, Y=0.]5. (A) Using all features, (B) after the removal of the two strongest correlated features C4_15 and C3_1 1 and, (C) after the removal 
of the four strongest correlated features C1_4-C4_15. Connections between interacting features were colored black. 



combining these effects under a multiplicative model (see 
[25] for a mathematical derivation). The interaction effects 
could then be assessed by comparing the observed and 
the expected accuracies. 

Out of the ten strongest connections for medianHou- 
seValue = 0 three were describing significant interac- 
tions. For instance, "population = [1167, 1726) AND 
totalRooms = [1448, 2127)" had an accuracy of 67.8% 
compared to an expected 51.7%. This increase in accur- 
acy is due to a specific interaction between the popula- 
tion in the area and the total number of rooms. 
Supposedly, the number of rooms per capita is what de- 
termines the house prices. 

In conclusion, an interaction between population and 
totalRooms was described by several connections. Add- 
itionally, a specific combination of latitude and longitude 
described an interaction predictive for low house prices, 
and a combination of high houseMedianAge and high 
totalRooms described an interaction predictive for very 
high house prices. Two of these pairs were reported as 



interacting by [24], but the third one is novel. The inter- 
action between latitude and longitude was very strong in 
the previous study and it indeed appeared in several of 
the strongest connections. However, only one specific 
combination of conditions showed a significant inter- 
action effect. This is most likely due to these two fea- 
tures being strongly correlated (r = -0.92; Additional file 5: 
Figure S3) and the assumption of independent effects 
therefore underestimated their interaction. 

Applications to leukemia and lymphoma 

Finally, we applied Ciruvis to biological data describing 
leukemia [26] and lymphoma [27]. The leukemia set con- 
tained gene expression for 7129 genes from 38 patients di- 
vided into two different outcomes: acute lymphoblastic 
leukemia (ALL; n = 27) and acute myeloid leukemia 
(AML; n = ll). The lymphoma set contained 4026 genes 
from 62 patients divided into three outcomes: lymphoma 
and leukemia (DLCL or D; n = 42), follicular lymphoma 
(FL or F; n = 9) and chronic lymphocytic leukemia (CLL 



medianHouseValue=0 (four) med i an House Value =1 [high) media n House Va lu e = 2 {very high} 




| housing Median Age 

latitude 

longitude 
I population 
I :t;',i Room--: 



■. 19) 
■. 33 94) 

*, -121.7ft) 
!*r 788) 
144 B) 



[19,29) 
[33.94, 34.26) 
[-121.79, -118.49) 
[788, 1167) 
[1448. 2127) 
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[29, 37) [37, «) 

[34.26,37.72) [37.72,*) 
[-118.49, -118.00) (-118.00,*) 
[1167,1726) 11726,*) 
[2127. 3148) [314S. *) 



Figure 3 Rule networks for regression data. Rule networks for the California housing data after removal of the median Income feature. The 
features are indicated by node color, and the condition values are shown in increasing order (low, middle-low, middle-high, high) on the circle. 
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or C; n = 11). The probe names were changed into gene 
names when possible and otherwise kept as in the source 
data. A single quote was used to discern between multiple 
probes matching the same genes. Since most genes had 
their expression discretized into two intervals by RO- 
SETTA (see Additional file 1: Supplementary methods for 
details on the discretization) the intervals were renamed 
into "low" and "high", with the addition of "medium" if ap- 
plicable. See (Additional file 8: Table S3 and Additional file 9: 
Table S4) for details on gene names and values. 

Firstly, we used Monte Carlo feature selection [28] to 
rank the genes by significance. After correcting for mul- 
tiple testing, there were 701 significant (p < 0.05) genes 
for leukemia and 512 for lymphoma. Details about the 
feature selection are described in the Supplementary 
methods (Additional file 3: Supplementary methods). A 
principal component analysis (PC A) verified that using 
the 30 most significant features the outcomes were sep- 
arable by the first two principal components (Figure 4). 
Missing values were replaced by the gene average during 
the PCA. Performing a disease association analysis using 
WebGestalt [29] we could confirm that the top ten dis- 
ease associations of the selected genes contained annota- 
tions related to lymphoma and leukemia. For example 
the leukemia data were enriched for genes related to 



Lymphoid Leukemia (LYN, CCND3, TCF3, CD33, and 
MYB; adjP = 0.024) and the lymphoma for Acute Mye- 
loid Leukemia (CALR, SUMO, and MYB; adjP = 0.18) 
and Acute Erythroblastic Leukemia (PCBP2 and MYB; 
adjP = 0.18). The p-values were calculated by WebGes- 
talt using the hypergeometric distribution and adjusted 
with Bonferroni correction. 

Next, we used ROSETTA to train a rule-based classifier 
based on the selected features. The accuracy of the classifier 
was 100% for both data sets, estimated by leave-one-out 
cross validation. Details on the classification are de- 
scribed in the Supplementary methods (Additional file 3: 
Supplementary methods). 

Since each rule set in the leave-one-out cross valid- 
ation was trained from all objects except one, they are 
expected to be very similar to rules trained on the whole 
data. Therefore, instead of repeatedly training a classifier 
on the whole data, we merged all the rules from the 
cross validation iterations. Duplicates were removed and 
the rules were filtered so that rules that are supersets of 
other rules were removed if they had lower significance 
(hypergeometric distribution); for details on the p- value 
calculations, see [30]. The motivation behind the filter- 
ing strategy is that shorter rules are preferred if they are 
at least equally significant as their longer counterparts. 
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Figure 4 Feature selection for leukemia and lymphoma. The separation of the outcomes (disease types) using the first two principal 
components was improved when the 30 most significant features were used instead of all features. The figures show (A) lymphoma before, 
(B) lymphoma after, (C) leukemia before, and (D) leukemia after feature selection, respectively. 
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The filtered set of rules was submitted to Ciruvis using 
default parameters. The interactive rule networks are 
available online at [31]. 

The rule network for leukemia is shown in Figure 5. 
The difference in the overall topology of the networks for 
ALL and AML may partly be explained by a different 
number of rules for each outcome (48 for ALL and 254 
for AML). Direct comparison between the networks was 
therefore difficult, since the same width would relate to a 
different number of rules. Instead we studied the strongest 
connections in each network. For this dataset both net- 
works were quite simple, with all connections supported 
by only one high-quality rule. For ALL the highest scoring 
connections were based on any pair of the following con- 
ditions: SPTAN1 = high, PTX3 = low, and CFP = low; the 
conditions SPTAN1 - high and CFP = low were the most 
frequent in other rules as well. Had the set of patients 
been larger, noiseless relationships would likely have been 
harder to identify and Ciruvis might have helped us iden- 
tify the most important pairs out of more complicated 



rules. The AML network showed the same property, with 
a large number of connections based on only one rule 
with a pair of conditions. Most likely, the reason why 
more combinations were found in this network was that 
no single condition constituted a high quality rule in itself 
which forced the generation of longer rules. 

Similar behavior was observed in some of the rule net- 
works for the lymphoma data (Figure 6). For CLL many 
connections were based on only one rule. The strongest 
connection (between MIF = low and GPX1 = high) was 
based on four rules. This combination corresponded to a 
rule with 73% accuracy, compared to an expected accur- 
acy of 51% assuming independent and multiplicative ef- 
fects, which indicated that an interaction could be 
present. The second strongest connection was between 
NT5C2 = low and GPX1 = high which showed an accur- 
acy of 84% compared to the expected 55%. A three-way 
interaction could be hypothesized and tested between 
NT5C2 = low, MIF = low and GPX1 = high with accuracy 
of 92% compared to the expected 83%. 



Acute Lymphoblastic Leukemia (ALL) 
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Figure 5 Rule networks for leukemia. Rule networks showing which rule conditions that are associated for leukemia. All connections are based 
on one rule each and are therefore of roughly equal score. The labels on each side of the figure are written the same order as the conditions 
appear in the figure. 
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Figure 6 Rule networks for lymphoma. Rule networks showing which rule conditions that are associated for lymphoma. The labels on each 
side of the figure are written the same order as the conditions appear in the figure. 
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The connections for the next outcome, DLCL, were 
supported only by one rule of high quality. Apparently, 
adding more conditions did not yield a significant increase 
in the rule quality. Notably, there are groups of conditions 
in the network that are interchangeable in certain rules. 
For instance, CXCL9 = high may be combined with either 
of PRKCB = low, PRKCB' = low, MXI1 = low, HDGF = 
high, TUBE = high and GENE669X = high to produce a 
rule for DLCL supported by all of the 42 patients in that 
group and with 100% accuracy. If instead GPX1 = high is 
combined with any of the six genes the second highest 
scoring connections are achieved with rules that are al- 
most as good; supported by 41 patients and with 100% 
accuracy. 

For the FL outcome, a hypothesized three-way interaction 
between GENE 1 625X = low, MIF= low and NTSC2 = low 
had to be rejected as the combined accuracy was lower 
than the predicted. Pairs of these conditions were separat- 
ing FL + CLL from DLCL and together with any of several 
other conditions they defined three-way interactions. 

Conclusions 

The requirements on classification methods to be user 
friendly and easy to interpret have increased over the 
past years. In that respect, rule-based classifiers which 
consist of IF-THEN "sentences" (or rules) make the 
models comparably easy to interpret. However, when the 
model has too many rules to be conveniently read, 
methods for visualization of the rules become important. 
We developed a web-based tool for rule visualization 
that is compatible with any type of classification rules. 
Its primary use is to provide a fast and easy visualization 
of a rule-based classifier. However, interpreting the rule 
networks can also help to generate hypotheses about fea- 
ture interactions; which was the main focus of this study. 
A limitation of rule-based models is that the attributes 
have to be discrete, but discretization techniques help 
overcome this. 

Using simulated data, we showed that the ROSETTA 
software may be used to construct rules that describe in- 
teractions even if the features lack marginal effects. Yet 
the rule detection may be biased towards features 
strongly correlated to the decision. We modeled different 
trade-offs between correlated and interacting features, and 
demonstrated to what degree stronger associations mask 
weaker ones. 

The masking is a consequence of the classification algo- 
rithm, which is biased towards using the most predictive 
features for classification, omitting weaker but still pre- 
dictive features or feature combinations. The problem 
arises when the interpretation of the classifier is import- 
ant. To detect masking features, correlations between 
each feature and the decision may be computed or Ciruvis 
may be used to identify nodes with connections to almost 



all other nodes. We introduced a strategy in which the 
features most strongly correlated to the decision are re- 
moved from the data and the model is re-generated, in 
order for weaker interactions to gain importance for the 
classifier and the Ciruvis network. 

An important difference as compared to other methods 
for interaction detection is that the rule networks are 
based on feature-value pairs (conditions) that tell us more 
precisely what feature values are involved in the interac- 
tions. Although not all the connections that were found in 
the networks were true interactions, the rule network is a 
fast method to generate a set of hypotheses to be further 
validated using other methods and new data. 

In a comparison using data that have previously been 
used for interaction detection, we could identify both 
the reported interactions and a possibly novel one. Sur- 
prisingly, the strongest interaction previously reported 
(longitude and latitude) was found several times in the 
network, but appeared as significant only once. This 
interaction was based on two strongly correlated features 
that contradicted the assumption of independent effects. 

Finally, we applied the tool to leukemia and lymphoma 
data. Our classification was very successful with 100% 
accuracy in the cross validation for both outcomes, simi- 
larly to what has been reported previously using multiple 
classification techniques [26,28]. The rule visualization 
provided a fast overview of the rule models and showed 
that there was very little overlap of conditions between 
the rules. This was likely caused by the small number of 
objects which allowed the individual rules to be of high 
quality; thus without the need for the rule-generation al- 
gorithm to construct a set of partly overlapping rules. 
Using the rule networks we were able to observe several 
possible interactions, of which many were computation- 
ally validated on our data. We believe it would be worth 
studying those interactions further and ultimately to val- 
idate them experimentally. 

By making the Ciruvis freely available on the web [19] 
we hope that it will benefit the further research on rule- 
based classifiers and interactions. Additionally, since de- 
cision trees are commonly used and may be translated 
into rules, the application of the tool on decision trees 
would also provide an interesting extension. 

Additional files 



Additional file 1: Table SI. Description of parameters and possible 
values for the rule submission form. 

Additional file 2: Figure SI. (A) Ciruvis submission form. (B) Ciruvis 
figure for the selected outcome "1" (high). Rules for the selected 
connection between totalRooms= [3148*) and medianlncome = [4.7435,*) 
are shown to the right. 

Additional file 3: Supplementary methods. Supplementary 
description of the methods. 
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Additional file 4: Figure S2. The number of correctly classified objects 
varied for different maximal correlation (X) and level of interaction (V). 
The points A-L here represent the different parameters choices in Figure 1. 
The average standard error of the number of correctly classified objects 
in the replicates with the same X and Y was 12.2 (95% CI 0.0-22.5), with 
datasets with the lowest X and Y having the highest variation. 

Additional file 5: Figure S3. Correlation between pairs of features and 
decision in the California Housing dataset are displayed in the upper half 
as filled circles with size relative to the correlation and in the lower half 
as values. Positive correlations are colored from white to blue (highest) 
and negative correlations from white and red (highest). 

Additional file 6: Figure S4. Rule networks for the California housing 
data including the medianlncome feature. The color of the nodes shows 
which feature it is, and the condition values are shown in increasing 
order (low, middle-low, middle-high, high) on the circle. 

Additional file 7: Table S2. Calculation of relative risks (RR) and their 
confidence intervals (CI) for each of the ten strongest connections for 
each outcome, as well as the expected (exp) values. Connections that 
had a RR significantly greater than what would be expected assuming 
independent effects are marked with yellow background and may 
indicate interaction effects. An asterisk '*' in the intervals denotes + or - °°. 

Additional file 8: Table S3. The 30 most significant features for the 
lymphoma data (p-values calculated by MCFS). The original name refer to 
the internal name in the source data set. The gene name is given 
whenever it was available. The range for the discretized expression values 
are given as Low and High. 

Additional file 9: Table S4. The 30 most significant features for the 
leukemia data (p-values calculated by MCFS). The original name refer to 
the internal name in the source data set. The gene name is given 
whenever it was available. The range for the discretized expression values 
are given as Low, Medium (if applicable) and High. 
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